Use of Native Xpath - Faster calculate when big data #906

mgogh · 2022-06-10T13:42:13Z

If you create a form with select_one_from_file based on big csv file (more than 3000 rows and 10 columns), then you make ten calculate for each columns, it will be very slow. (the select_one was filtered by an other question).

Exemple : In France, we have 34 000 communes, we can filter it with region and department.

Trying to always use Xpath native approach improve the value search when model are big.
If expr contains custom OpenRosa functions, it will use the fork as expected (jsEvaluate) which is slower than the native method.

… are fat

eyelidlessness · 2022-06-10T17:46:04Z

Hi @mgogh, thanks for this PR. I arrived at a similar approach in a project I'm using for exploring/prototyping a variety of performance improvements, but hadn't gotten around to getting it into a PR. In my prototype, I arrived at a few additional optimizations.

I found that pre-compiling expressions with document.createExpression performs better than calling document.evaluate directly. (Surprisingly, it performs better even if you discard the expression and recompile on each evaluation, at least in Chrome.)
Wrapping both cases in a class of the same shape performs better still, and catching the error on construction rather than evaluation performs much better still. This is partly because classes with a consistent shape are good JIT optimization targets, and partly because the try/catch branching is much more minimal and predictable.

I'll take some time to bring the pertinent prototype code in for a PR. In the meantime, would you be able to share a form like the one you described? I'd like to add it to my collection of performance-related forms.

MartijnR · 2022-06-10T19:57:02Z

src/js/form-model.js

@@ -1561,30 +1561,18 @@ FormModel.prototype.evaluate = function (
    }

    // try native to see if that works... (will not work if the expr contains custom OpenRosa functions)
-    if (
-        tryNative &&


Note, it is unfortunately not safe to always try native because in this ecosystem we do things that can be evaluated by a native evaluator but return an incorrect result for ODK XForms. I'm surprised no tests failed though (or did they?).

The main (and perhaps only) things are comparisons and arithmetic with date/dateTime strings.

Oh, this is a shame. I see a few other cases called out in the spec. This feels like something we could maybe revisit with the tree-sitter-xpath grammar, which is fast enough that we'd still see significant performance improvements.

Yes, it's a huge shame. We've had discussions about a turbo mode that requires form designers to wrap such strings with date and date-time.

Yes, indeed, good find. There are some native XPath 1.0. functions that have deviating behavior so some of those would also be an issue.

Curious about tree-sitter! Will it be able to find out if the value of /path/to/node is a date string?

Alone, the tree-sitter grammar won't be able to do any kind of static type analysis, that's not available in the AST. Here's ways I'd imagine it could help for this case:

Rule out expressions with functions we know deviate (although we can skip some because they'll fail to compile, which should perform similarly).

Rule out expressions where we know an argument position is of date/date-time type.

Relate nodeset expressions to their bindings, to identify their type and rule out expressions with those nodesets¹.

Rule out expressions with literals which would be treated as dates.

If all of this sounds like it has overlap with openrosa-xpath-evaluator's parsing responsibilities... it does, heh. But it would probably be a good fit for this case, because tree-sitter is exceptionally fast.

Footnotes

This dovetails with other prototype work I've explored, identifying nodeset subexpressions to determine their dependencies. This already works for a huge set of expressions I pulled from openrosa-xpath-evaluator's tests, you can see the test fixtures used to validating that here. The grammar has proven pretty reliable so far. You can see example usage here and here to find nodeset sub-expressions. I have additional prototype work (currently only local) for resolving those sub-expressions to actual related nodes, which so far has worked for everything I've tried except with relative nested-subexpressions (e.g. in a predicate). ↩

Cool stuff!

I have additional prototype work (currently only local) for resolving those sub-expressions to actual related nodes, which so far has worked for everything

I'm guessing the challenge might be to prevent this from becoming too costly (possibly negate performance improvements achieved by sending to the native evaluator). Will be awesome if that is possible!

It’s more than possible, it’s a reality! I’ll push up instructions for running and measuring the subexpression logic when I get a chance, but for now I’ll just say that finding subexpressions in all of the cases I pulled from the current evaluator test suite averages 1ms or less even when my computer is throttling under heavy load. In my local stress testing, the native evaluator is also generally ~1ms, the extended evaluator is generally 10+ms.

jdugh · 2022-06-11T14:30:39Z

Hi @eyelidlessness ,
Here, a form with a list of avg 40 000 rows (French municipalities) and 11 calcuation.
https://ee.kobotoolbox.org/Cijpflhc
the form takes a long time to load the data (be patient)

The XLSForm and datas :
XLSFORM_big_list.xlsx
communes.csv
departement.csv
region.csv

Try a smaller list (~1 300 rows) :
https://ee.kobotoolbox.org/1CmdNbQ0
communes_light.csv
XLSFORM_small_list.xlsx
(same departement/region)

eyelidlessness · 2022-06-15T00:39:07Z

Thank you @jdugh! I meant to reply earlier, but wound up on a yak shaving adventure trying to get the large CSV to load on my local enketo-express/ODK central setup.

That aside, this is an awesome case to add to my growing collection of performance stress tests.

Earlier today @lognaturel and I discussed a safer, more limited approach to this. Instead of always deferring to the native evaluator, or doing more complex analysis of queries, we'll likely start with a more naive analysis to optimize queries which are obviously straightforward (nodeset references, basic operators with non-ambiguous operands). This isn’t the end of the line for optimization potential I’m exploring, but it will be a big perf boost for a lot of common cases and a lot of the groundwork is already laid.

MG60065 added 2 commits June 10, 2022 15:15

try to always use native approach to improve faster search when model…

a3dd5d2

… are fat

prettier fix

7ae4caa

MartijnR reviewed Jun 10, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of Native Xpath - Faster calculate when big data #906

Use of Native Xpath - Faster calculate when big data #906

mgogh commented Jun 10, 2022

eyelidlessness commented Jun 10, 2022

MartijnR Jun 10, 2022 •

edited

Loading

eyelidlessness Jun 10, 2022

MartijnR Jun 10, 2022 •

edited

Loading

eyelidlessness Jun 10, 2022

MartijnR Jun 14, 2022

eyelidlessness Jun 15, 2022

jdugh commented Jun 11, 2022

eyelidlessness commented Jun 15, 2022

Use of Native Xpath - Faster calculate when big data #906

Are you sure you want to change the base?

Use of Native Xpath - Faster calculate when big data #906

Conversation

mgogh commented Jun 10, 2022

eyelidlessness commented Jun 10, 2022

MartijnR Jun 10, 2022 • edited Loading

Choose a reason for hiding this comment

eyelidlessness Jun 10, 2022

Choose a reason for hiding this comment

MartijnR Jun 10, 2022 • edited Loading

Choose a reason for hiding this comment

eyelidlessness Jun 10, 2022

Choose a reason for hiding this comment

Footnotes

MartijnR Jun 14, 2022

Choose a reason for hiding this comment

eyelidlessness Jun 15, 2022

Choose a reason for hiding this comment

jdugh commented Jun 11, 2022

eyelidlessness commented Jun 15, 2022

MartijnR Jun 10, 2022 •

edited

Loading

MartijnR Jun 10, 2022 •

edited

Loading